RStudio exercise 4: Clustering and classification

Introduction to the data

In this exercise we use Boston data from MASS-library. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. Data includes 14 variables and 506 rows.

## [1] 506  14
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
variable description
crim per capita crime rate by town
zn proportion of residential land zoned for lots over 25,000 sq.ft.
indus proportion of non-retail business acres per town
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox nitrogen oxides concentration (parts per 10 million)
rm average number of rooms per dwelling
age proportion of owner-occupied units built prior to 1940
dis weighted mean of distances to five Boston employment centres
rad index of accessibility to radial highways
tax full-value property-tax rate per $10,000
ptratio pupil-teacher ratio by town
black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
lstat lower status of the population (percent)
medv median value of owner-occupied homes in $1000

Graphical overview of the data

Plot matrix of the data

There are some very intresting distributions fo variables in the plot matrix. Variable rad has high and low values so the plot shows that the values are consenrated either side of the plot. VAriable *

Plotted correlation matrix

Plotted correlation matrix shows that there is some high correlation between variables:

  • Correlation is quite clear between industrial areas (indus) and nitrogen oxides (nox). Industry adds pollution in the area. Industry seems to correlate also with variablrs like age, dis, ras and tax.

  • Nitrogen oxides (nox) correlations are very similar with industry (indus)

  • Crime rate (crim) seems to correlate with good accessibilitty to radial highways (rad) and value property (tax).

  • Old houses (age) and employment centers have also something common

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Scaled data

All the variables are numerical so we can use scale()-function to scale whole data set.

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
## [1] "matrix"

Scaling the data makes variables look as if they are in the same range. Variables like black and tax were before scaling hundred fold compared to some other variables.

Creating a new categorical variable crime

Variable crim is the base of the new categorical variable crime.

categories quantile points
low 0%-25%
med_low 25%-50%
med_high 50%-75%
high 75%-100%

Quantile points of the variable crim

##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
## crime
##      low  med_low med_high     high 
##      127      126      126      127
##        zn               indus              chas              nox         
##  Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723   Min.   :-1.4644  
##  1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723   1st Qu.:-0.9121  
##  Median :-0.48724   Median :-0.2109   Median :-0.2723   Median :-0.1441  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723   3rd Qu.: 0.5981  
##  Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648   Max.   : 2.7296  
##        rm               age               dis               rad         
##  Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black             lstat        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808   Median :-0.1811  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453  
##       medv              crime    
##  Min.   :-1.9063   low     :127  
##  1st Qu.:-0.5989   med_low :126  
##  Median :-0.1449   med_high:126  
##  Mean   : 0.0000   high    :127  
##  3rd Qu.: 0.2683                 
##  Max.   : 2.9865

Train and test sets

Training set contains 80% of the data. 20% is in the test set.

##   [1] 203 454 483 224 422 488 405 415 272 232 125 159 356 120 429  70  90
##  [18] 457 277 424  57 248 312 481 498 320 455 298 102  52  88  77  31 278
##  [35] 284 132 292 418 438 191 451 139 386  61 305 448 219 476 473 231 400
##  [52] 247   1  15 370 453 252 357 140 301 391 273 122 146 109 244 101  56
##  [69] 426  68 490 254 393 367 472 462 243 368 327 447 192  82 442 332 469
##  [86] 342 353  30 313 328 309 299 485 484 169 148 311 323   4 156 374 187
## [103]   8 343 268  84 420   5 395 325 470 465 174 168 149  69 456 220 337
## [120] 121 413 266  48   3 246 468 458 234 329 387 230 364 213 256 355  46
## [137] 276  65 445 105 211 251 396 160 359 270 280 466 171 435 151  18  67
## [154] 408 199 491  14 188  37 427  66 184 315  20  74  72 103  28 218 348
## [171] 503 414  36 228 100 324 341 242 384 233 106 340 296 108  41 399 204
## [188] 158  25 162 432  75 182 322 200 223 236 354 124 249  98  63 303 215
## [205]  19 440 142  76 346   2 423 336 245 390  29 496 181  58 258 206 119
## [222]  86 330  26  96 289 180 372 163 385 389 153 195  12 388 439 170  51
## [239]  39 495 373 401 428 380 326  23 339 317 111   7 185 409 290 177 394
## [256] 378 437 260 471 350 126  80 333 492 314 371 352 486 197  81  44 331
## [273]   9 172  13  93 302 361 239 287 307 434 128 144 238  62 198 241 407
## [290] 285 397  79 467 433 304 392 318 477 382 253 358 150 449 479 176 250
## [307] 235 344  55 497 499 216 482  24 217 275 417 201 294  97  99 186 446
## [324] 210 406  22 494 504 282 240 281 179 274 141 147 129 381 114 152  50
## [341]  59 334 116 441 288 135 205 493 295 178 107  34 316 376 112 319  54
## [358] 209 403 404 286 196  42 460 506 193 474 202 255 345  35  45  60  53
## [375] 154 104 118 487 138 489 166 269 338 377 115 369 500 237  21  95  17
## [392]   6 450 183  78 267  85 505  89  87 425 464  27  10

Fitting the Linear Discriminant Analysis

First the linear discriminant analysis (LDA) is fitted to the train set. The new categorical variable crime is the target variable and all the other variables of the dataset are predictor variables.

After fitting we draw the LDA biplot with arrows.

## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2623762 0.2648515 0.2252475 0.2475248 
## 
## Group means:
##                   zn      indus         chas        nox          rm
## low       0.94651463 -0.9183456 -0.123759247 -0.8754984  0.44269466
## med_low  -0.07850487 -0.2790180  0.022033567 -0.5628294 -0.12746575
## med_high -0.40996711  0.1550927  0.030524797  0.3389089 -0.06787385
## high     -0.48724019  1.0171519  0.003267949  1.0392419 -0.41965157
##                 age        dis        rad        tax     ptratio
## low      -0.8680064  0.9107230 -0.6936710 -0.7205813 -0.48494210
## med_low  -0.3188371  0.3661931 -0.5439511 -0.4562280 -0.02651464
## med_high  0.3174192 -0.2891248 -0.3962792 -0.2926659 -0.18960230
## high      0.7791114 -0.8477139  1.6377820  1.5138081  0.78037363
##                black       lstat         medv
## low       0.37963993 -0.75582641  0.527568166
## med_low   0.31219254 -0.14691625 -0.008139779
## med_high  0.06148452  0.01642143  0.058803298
## high     -0.75752488  0.89880710 -0.676822429
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.112563363  0.82215648 -0.73652558
## indus   -0.040362373 -0.34248126  0.38224810
## chas     0.007063488 -0.01113207  0.26493627
## nox      0.446663016 -0.68030453 -1.61410171
## rm       0.051985099  0.07864996 -0.05682784
## age      0.244295125 -0.26014105 -0.07571549
## dis     -0.094134085 -0.41383725  0.10338826
## rad      3.172360636  0.89170457 -0.02517781
## tax      0.001634071  0.02359935  0.53513182
## ptratio  0.124685795  0.01143191 -0.15060181
## black   -0.106181682  0.02030843  0.10161962
## lstat    0.198462827 -0.17032902  0.41705915
## medv     0.061429329 -0.36218939 -0.26170105
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9563 0.0317 0.0120
##   [1] 1 4 4 3 4 4 4 4 2 3 2 3 2 2 4 2 1 4 2 4 1 2 3 4 3 3 4 2 2 1 1 2 3 1 1
##  [36] 3 1 4 4 2 4 2 4 2 1 4 2 4 3 3 4 3 1 3 4 4 2 4 3 1 4 2 1 3 2 2 2 1 4 1
##  [71] 2 3 4 4 4 4 2 4 3 4 1 1 4 1 4 1 1 3 3 2 3 1 3 3 3 3 3 3 1 3 4 1 2 1 3
## [106] 1 4 1 4 3 4 4 2 3 3 2 4 2 1 1 4 3 2 1 2 4 4 3 1 4 3 4 2 1 1 2 2 1 4 2
## [141] 2 2 4 3 4 2 2 3 3 4 3 3 1 4 1 2 3 1 2 4 1 2 3 3 2 2 2 3 1 1 1 4 1 3 1
## [176] 3 1 2 4 3 2 1 2 2 1 4 1 3 3 3 4 1 1 2 1 3 3 1 2 2 2 2 2 3 3 4 3 2 1 1
## [211] 4 1 2 4 3 2 1 1 3 2 2 1 1 3 2 1 1 4 3 4 4 3 1 2 4 4 3 2 2 3 4 4 4 4 2
## [246] 3 1 3 2 2 2 4 1 1 4 4 4 3 4 1 2 2 1 2 3 4 1 3 1 1 2 1 2 3 2 1 1 4 2 1
## [281] 1 4 3 4 3 2 1 2 4 1 4 1 4 4 2 4 2 4 4 2 4 3 4 4 1 2 3 1 1 3 2 2 4 3 1
## [316] 1 4 1 2 2 1 1 4 3 4 3 2 1 1 2 1 1 2 3 3 3 4 2 3 2 2 1 2 4 1 3 1 2 1 1
## [351] 2 3 2 4 2 3 1 2 4 4 1 1 2 4 1 2 4 1 1 1 3 2 2 1 3 2 2 4 3 2 3 3 1 4 2
## [386] 4 2 3 3 1 3 1 4 2 2 3 1 2 1 1 4 4 3 2

Predicting the classes

##           predicted
## correct    low med_low med_high high
##   low       13       8        0    0
##   med_low    2      15        2    0
##   med_high   1       9       24    1
##   high       0       0        0   27

Prediction were quite good. There was some errors in the middle of the range but classes low and especially high were good. Only one correct representative of high class was predicted to med_low class.

K-means algorithm

I’m going to calculate what is the optimal number of clusters for Boston data. First I reload and scale the data. Variables need to be scaled to get comparable distances between observation.

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Next I calculate the distances between observations and determinen the number of clusters.

One way to determine the number of clusters is to look how the total of within cluster sum of squares (WCSS) behaves when the number of clusters changes. WCSS was calculated from 1 to 15 clusters. The optimal number of clusters is when the total WCSS drops radically. It seems that in this case optimal number of clusters is two. However we are here to learn so I decided to analyse model with four clusters.

After determining the number of clusters I run the K-means alcorithm again.

It seems that when the data is divided to four clusters there is some clear differences in distriputions of several variables. Crim, zn, indus and blacks are variables were one can distinguish clear patterns between clusters. Some variables (rad & tax) show that sometimes 1 or 2 clusters make a clear distripution but observation of other two clusters are ambigious and there is no clear pattern to be regognised.

BONUS: LDA using clusters as target classes

After loading the Boston dataset I scale it to get comparable distances.

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv             clust      
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063   Min.   :1.000  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989   1st Qu.:2.000  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449   Median :3.000  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   :2.674  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683   3rd Qu.:3.000  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865   Max.   :4.000

Original Boston dataset is now scaled and the result of K-means clustering is saved to the variable clust

LDA with the clusters

Next the LDA is performed and the biplot with arrows is created

## Call:
## lda(clust ~ ., data = scaled_Boston)
## 
## Prior probabilities of groups:
##         1         2         3         4 
## 0.2114625 0.1304348 0.4308300 0.2272727 
## 
## Group means:
##         crim         zn      indus       chas        nox         rm
## 1 -0.3912182  1.2671159 -0.8754697  0.5739635 -0.7359091  0.9938426
## 2  1.4330759 -0.4872402  1.0689719  0.4435073  1.3439101 -0.7461469
## 3 -0.3894453 -0.2173896 -0.5212959 -0.2723291 -0.5203495 -0.1157814
## 4  0.2797949 -0.4872402  1.1892663 -0.2723291  0.8998296 -0.2770011
##          age        dis        rad        tax     ptratio       black
## 1 -0.6949417  0.7751031 -0.5965444 -0.6369476 -0.96586616  0.34190729
## 2  0.8575386 -0.9620552  1.2941816  1.2970210  0.42015742 -1.65562038
## 3 -0.3256000  0.3182404 -0.5741127 -0.6240070  0.02986213  0.34248644
## 4  0.7716696 -0.7723199  0.9006160  1.0311612  0.60093343 -0.01717546
##        lstat        medv
## 1 -0.8200275  1.11919598
## 2  1.1930953 -0.81904111
## 3 -0.2813666 -0.01314324
## 4  0.6116223 -0.54636549
## 
## Coefficients of linear discriminants:
##                 LD1        LD2         LD3
## crim     0.18113078 -0.5012256 -0.60535205
## zn       0.43297497 -1.0486194  0.67406151
## indus    1.37753200  0.3016928  1.07034034
## chas    -0.04307937 -0.7598229 -0.22448239
## nox      1.04674638 -0.3861005 -0.33268952
## rm      -0.14912869 -0.1510367  0.67942589
## age     -0.09897424  0.0523110  0.26285587
## dis      0.13139210 -0.1593367 -0.03487882
## rad      0.65824136  0.5189795  0.48145070
## tax      0.28903561 -0.5773959  0.10350513
## ptratio  0.22236843  0.1668597 -0.09181715
## black   -0.42730704  0.5843973  0.89869354
## lstat    0.24320629 -0.6197780 -0.01119242
## medv     0.21961575 -0.9485829 -0.17065360
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.7596 0.1768 0.0636

Biplot shows that variables indus, zn and medv are the most influencial separators for the clusters.

Super-bonus

3D plot where observations color is the crime classes of the train set

3D plot where observations color is based on the K-means clusters.

Colors of the both plots is based to four classes. It seems that K-means plot shows the different clusters more clearly than the plot that is based on the crime classification.